import scanpy as sc
import scrublet as scr
import numpy as np
import os
import pandas as pd
import polars as pl
import matplotlib.pyplot as plt
import seaborn as snsIntroduction to scRNAseq analysis using scanpy
Workshop
Goal
Introduction to scRNAseq analysis using scanpy
Who am I?
- Irene Robles Rebollo
- BSc in Biochemistry
- MSc in Applied Biosciences and Biotechnology
- PhD in Cell Biology
- Half experimental
- Half computational
- Bioinformatician at Cosyne Therapeutics
History
| Year | Achievement |
|---|---|
| 1953 | DNA structure |
| 1985 | First PCR - DNA amplification |
| 1988 | PCR with Taq polymerase - Automated DNA amplification |
| 1993 | First qPCRs - RNA amplification |
| 1995 | Microarrays - Quantify an array gene expression using a chip |
| 2001 | First draft of the human genome |
| 2003 | First NGS DNA sequencer - High throughput DNA sequencing |
| 2006 | First RNA-seq - High throughput RNA sequencing |
| 2009 | First single-cell RNA-seq (Tang et al. 2009) |
| 2013 | Single-cell RNA-seq is declared method of the year |
Single-cell vs bulk RNAseq
| Feature | Bulk data | Single-cell data |
|---|---|---|
| Cell resolution | Average of all cells | Individual cell resolution |
| Sample representation | Vector of gene expression values | Matrix of gene expression values |
| Genomic resolution | Higher, depends on sequencing depth | Lower, depends on starting material |
| Cost | Lower | High |
| Computational requirements | Lower | Higher |
| Data size | Lower | Higher |
| Data interpretation | Simple | Complex |
Sample representation
- In bulk data, each sample is repressented by a vector, where each value is a gene.
- In single cell data, each sample is a matrix, where each row is a gene and each column is a cell.
\[\begin{align} Bulk &= \begin{bmatrix} gene_{1} \\ gene_{2} \\ gene_{3}\\ \vdots \\ gene_{n} \end{bmatrix} \\ \\ Single-cell &= \begin{bmatrix} gene_1, cell_1 & gene_1, cell_2 & gene_1, cell_3 & \dots & gene_1, cell_m \\ gene_2, cell_1 & gene_2, cell_2 & gene_2, cell_3 & \dots & gene_2, cell_m \\ gene_3, cell_1 & gene_3, cell_2 & gene_3, cell_3 & \dots & gene_3, cell_m \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ gene_n, cell_1 & gene_n, cell_2 & gene_n, cell_3 & \dots & gene_n, cell_m \end{bmatrix} \end{align}\]
Scanpy vs Seurat
- Scanpy is a Python package for single-cell analysis (Wolf, Angerer, and Theis 2018)
- Seurat is an R package for single-cell analysis (Satija et al. 2015)
- Both are:
- User-friendly tools for single-cell analysis
- Open source
- Well-documented
- Widely-used
- Choice depends on:
- Language preference
- Team expertise
- Integration with downstream analysis
- Speed and memory requirements
Hint: A good bioinformatician is not restricted by language. You can use R in Python can be done using the
rpy2package. And Python can be use within R usingreticulate.
Scale of scRNAseq data
AnnData object
Set up
- Install Miniconda
- Create a new environment
conda create -n myscanpy python=3.10
conda activate myscanpy
pip install -r requirements.txtImport libraries
Scanpy setttings
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor='white')scanpy==1.9.8 anndata==0.10.5.post1 umap==0.5.5 numpy==1.26.4 scipy==1.12.0 pandas==2.2.0 scikit-learn==1.4.0 statsmodels==0.14.1 igraph==0.11.3 pynndescent==0.5.11
Download data
mkdir data
cd data
wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE190nnn/GSE190622/suppl/GSE190622%5Fcount%5Fmatrix%5FAnnotated.csv.gzLoad tests datasets
Single-cell SMART-seq data from mouse macrophages (Robles-Rebollo et al. 2022)
mf = pd.read_csv("data/GSE190622_count_matrix_Annotated.csv.gz", index_col=0)
adata = sc.AnnData(X=mf.T)
adata.obs["genotype"] = adata.obs_names.to_series().apply(lambda x: x.split("_")[0])
adata.obs["timepoint"] = adata.obs_names.to_series().apply(lambda x: x.split("_")[1])
adata.obs["sample"] = adata.obs["genotype"] + "_" + adata.obs["timepoint"]
adata.var_names_make_unique()
adataAnnData object with n_obs × n_vars = 1362 × 18429
obs: 'genotype', 'timepoint', 'sample'
Doublets
- A doublet is a single-cell transcriptome that has been generated by two cells
- They have often have a higher read and gene count
- They can produce confounding results
How do we fight doublets: - Experiment design - Computationally
Scrublet
Scrublet finds doublets based os simulations (Wolock, Lopez, and Klein 2019)
scrub = scr.Scrublet(adata.X, expected_doublet_rate=0.06)
doublet_scores, predicted_doublets = scrub.scrub_doublets()
scrub.plot_histogram()Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.41
Detected doublet rate = 0.1%
Estimated detectable doublet fraction = 25.8%
Overall doublet rate:
Expected = 6.0%
Estimated = 0.3%
Elapsed time: 26.6 seconds
(<Figure size 640x240 with 2 Axes>,
array([<Axes: title={'center': 'Observed transcriptomes'}, xlabel='Doublet score', ylabel='Prob. density'>,
<Axes: title={'center': 'Simulated doublets'}, xlabel='Doublet score', ylabel='Prob. density'>],
dtype=object))
Adjust threshold
predicted_doublets = scrub.call_doublets(threshold=0.2)
scrub.plot_histogram()Detected doublet rate = 0.7%
Estimated detectable doublet fraction = 50.8%
Overall doublet rate:
Expected = 6.0%
Estimated = 1.4%
(<Figure size 640x240 with 2 Axes>,
array([<Axes: title={'center': 'Observed transcriptomes'}, xlabel='Doublet score', ylabel='Prob. density'>,
<Axes: title={'center': 'Simulated doublets'}, xlabel='Doublet score', ylabel='Prob. density'>],
dtype=object))
Visualise doublets
scrub.set_embedding('UMAP', scr.get_umap(scrub.manifold_obs_, 10, min_dist=0.3))
scrub.plot_embedding('UMAP', order_points=True);Filter out doublets
adata.obs["scrublet_doublet"] = predicted_doublets
adata = adata[adata.obs["scrublet_doublet"].apply(lambda x: x is False)]Preprocessing
Highest expressing genes
Look for suspects: MALAT1, mitochondrial genes, ribosomal genes, componenets of the cytoskeleton, etc.
sc.pl.highest_expr_genes(adata, n_top=20)normalizing counts per cell
finished (0:00:00)
Quality metrics
sc.pp.calculate_qc_metrics computes quality control metrics for each cell. - Number of counts per cell - Number of genes per cell - Percentage of counts that come from mitochondrial genes.
# Mitochondrial genes
adata.var['mt'] = adata.var_names.str.startswith('mt-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
jitter=0.4, multi_panel=True, rotation = 90)Different sampes might require different thresholds
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
jitter=0.4, multi_panel=True, groupby='sample', rotation = 90)sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt', color='sample')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts', color='sample')In this case, cells from different samples are reasonably homogeneous. However, that is not always the cells_per_paper.ipynb
Filtering
Filter cells based on quality metrics
sc.pp.filter_cells(adata, min_genes=500)
sc.pp.filter_genes(adata, min_cells=5)
adata= adata[adata.obs.n_genes_by_counts < 5000, :]
adata= adata[adata.obs.pct_counts_mt < 15, :]filtered out 1 genes that are detected in less than 5 cells
Normalisation: size normalisation and log transformation
For simplicity, we will use the simplest method: library size normalisation and log transformation, but there are others Hafemeister and Satija (2019).
Steps:
- Transcript length normalisation:
- Adjusts for differences in transcript length between genes
- Divides the counts in each cell by the length of the transcript
- Not necessary if you sequence a fixed region of the transcript ( 3’ or 5’ in 10X, our case)
- Library size normalisation:
- Adjusts for differences in sequencing depth between cells
- Divides the counts in each cell by the total counts in that cell and multiplies by a scale factor (e.g. 10,000)
sc.pp.normalize_total(adata, target_sum=1e4)normalizing counts per cell
finished (0:00:00)
- Log transformation:
- Logarithm of the normalised counts
- Makes the data more normally distributed
sc.pp.log1p(adata)Highly variable genes
- Genes that show the most variation across cells
- Often the most informative genes
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)extracting highly variable genes
finished (0:00:00)
--> added
'highly_variable', boolean vector (adata.var)
'means', float vector (adata.var)
'dispersions', float vector (adata.var)
'dispersions_norm', float vector (adata.var)
Dimensionality reduction
Goal
- Reducing the number of input variables or features in a dataset.
Why
- Noise Reduction: Focus on the most variance-contributing features, thus enhancing the signal-to-noise ratio.
- Data Visualization: We can plot in 2 or 3 dimensions
- Aid clustering and clasification
- Computational efficiency
Dimensionality reduction techniques
- Principal Component Analysis (PCA): PCA is often the first step in dimensionality reduction for single-cell data. It identifies the principal components that capture the most variance in the data, reducing its dimensionality while preserving its structure as much as possible.
- t-Distributed Stochastic Neighbor Embedding (t-SNE): Popular dimensionality reduction that foccusses on the preservation on local structures
- Uniform Manifold Approximation and Projection (UMAP): Popular dimensionality reduction that foccusses on the preservation on global structures
- Autoencoders: Deep learning-based autoencoders can learn to compress single-cell data into a lower-dimensional representation in an unsupervised manner, capturing nonlinear relationships in the data.
Prepare data for dimensionality reduction
The .raw attribute freezes the state of the AnnData object for later use. We can recumerate it by calling .raw.to_adata().
adata.raw = adataFilter by highly variable genes
adata= adata[:, adata.var.highly_variable]
adataView of AnnData object with n_obs × n_vars = 1277 × 3193
obs: 'genotype', 'timepoint', 'sample', 'scrublet_doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'n_genes'
var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'sample_colors', 'log1p', 'hvg'
Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed.
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])regressing out ['total_counts', 'pct_counts_mt']
finished (0:00:02)
Scale each gene to unit variance. Clip values exceeding standard deviation to 10.
sc.pp.scale(adata, max_value=10)Principal component analysis (PCA)
PCA is often the first step in dimensionality reduction for single-cell data. It identifies the principal components that capture the most variance in the data, reducing its dimensionality while preserving its structure as much as possible.
sc.tl.pca(adata, svd_solver='arpack')computing PCA
on highly variable genes
with n_comps=50
finished (0:00:52)
sc.pl.pca(adata, color='sample')Varinace explained by each component:
sc.pl.pca_variance_ratio(adata, log=True)UMAP and t-SNE
First, we need to compute the neighbourhood graph:
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)computing neighbors
using 'X_pca' with n_pcs = 40
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
Compute UMAP:
sc.tl.umap(adata)computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm) (0:00:01)
Compute t-SNE:
sc.tl.tsne(adata)computing tSNE
using 'X_pca' with n_pcs = 50
using sklearn.manifold.TSNE
finished: added
'X_tsne', tSNE coordinates (adata.obsm) (0:00:03)
Plot UMAP:
sc.pl.umap(adata, color='sample', use_raw=False)Plot t-SNE:
sc.pl.tsne(adata, color='sample', use_raw=False)Clustering the neighborhood graph
As with Seurat and many other frameworks, we recommend the Leiden graph-clustering method (community detection based on optimizing modularity) by Traag et al. (2018). Note that Leiden clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.
sc.tl.leiden(adata)
sc.pl.umap(adata, color='leiden', use_raw=False)running Leiden clustering
finished: found 8 clusters and added
'leiden', the cluster labels (adata.obs, categorical) (0:00:00)
Finding marker genes
Let us compute a ranking for the highly differential genes in each cluster. For this, by default, the .raw attribute of AnnData is used in case it has been initialized before. The simplest and fastest method to do so is the t-test.
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)ranking genes
finished: added to `.uns['rank_genes_groups']`
'names', sorted np.recarray to be indexed by group ids
'scores', sorted np.recarray to be indexed by group ids
'logfoldchanges', sorted np.recarray to be indexed by group ids
'pvals', sorted np.recarray to be indexed by group ids
'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:01)
pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5)| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
|---|---|---|---|---|---|---|---|---|
| 0 | Ccl5 | Ctsd | Mapkapk2 | Lgals3 | Sepp1 | Tnfsf9 | Tubb5 | Gm12155 |
| 1 | Il1rn | Laptm5 | Ebi3 | Ccnd1 | Tnfsf9 | Cxcl2 | Tuba1b | Gm42836 |
| 2 | Isg15 | Lamp1 | Il1b | Ifi27l2a | Nfkbiz | Tnf | Top2a | Gm11942 |
| 3 | Saa3 | Gm42418 | Serp1 | Adam8 | Cxcl1 | Ccl4 | Hist1h2ap | Gm13341 |
| 4 | Nos2 | Gas6 | Ehd1 | Lpl | Serinc3 | Id2 | Hmgb2 | Rplp1 |
During the course of this analysis, the AnnData accumlated the following annotations.
adataAnnData object with n_obs × n_vars = 1277 × 3193
obs: 'genotype', 'timepoint', 'sample', 'scrublet_doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'n_genes', 'leiden'
var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'sample_colors', 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'tsne', 'leiden', 'leiden_colors', 'rank_genes_groups'
obsm: 'X_pca', 'X_umap', 'X_tsne'
varm: 'PCs'
obsp: 'distances', 'connectivities'
Inspirations
- Analysis of single cell RNA-seq data
- Course from University of Cambridge Bioinformatics training unit
- In R
- Single cell study database
- Scanpy tutorials